Generalized Biwords for Bitext Compression and Translation Spotting

نویسندگان

  • Felipe Sánchez-Martínez
  • Rafael C. Carrasco
  • Miguel A. Martínez-Prieto
  • Joaquín Adiego
چکیده

Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords —pairs of parallel words with a high probability of cooccurrence— that can be used as an intermediate representation in the compression process. However, the simple biword approach described in the literature can only exploit one-toone word alignments and cannot tackle the reordering of words. We therefore introduce a generalization of biwords which can describe multi-word expressions and reorderings. We also describe some methods for the binary compression of generalized biword sequences, and compare their performance when different schemes are applied to the extraction of the biword sequence. In addition, we show that this generalization of biwords allows for the implementation of an efficient algorithm to look on the compressed bitext for words or text segments in one of the texts and retrieve their counterpart translations in the other text —an application usually referred to as translation spotting— with only some minor modifications in the compression algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generalized Biwords for Bitext Compression and Translation Spotting: Extended Abstract

The increasing availability of large collections of bilingual parallel corpora has fostered the development of naturallanguage processing applications that address bilingual tasks, such as corpus-based machine translation, the automatic extraction of bilingual lexicons, and translation spotting [Simard, 2003]. A bilingual parallel corpus, or bitext, is a textual collection that contains pairs o...

متن کامل

Harnessing the Redundant Results of Translation Spotting

Translation spotting consists in automatically identifying the translations of a user query inside a bitext. This task, when it relies solely on statistical word alignment algorithms, fails to achieve excellent results. In this paper, we show that identifying the translations of a query during a first translation spotting stage provides relevant information that can be used in a second stage to...

متن کامل

Boosting Bitext Compression

Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bitexts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each ...

متن کامل

A Two-Level Structure for Compressing Aligned Bitexts

A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our...

متن کامل

An Attentional Model for Speech Translation Without Transcription

For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 43  شماره 

صفحات  -

تاریخ انتشار 2012